PANDITplus : toward better integration of evolutionary view on molecular sequences with supplementary bioinformatics resources

نویسندگان

  • Slavica Dimitrieva
  • Maria Anisimova
چکیده

Recent comparative genomic and other large-scale bioinformatics studies increasingly have been using gene annotations, functional classifications, and complementary data from the emerging “-omics” disciplines. Indeed, such analyses have better chances to uncover hidden patterns in complex multidimensional and heterogeneous biological systems data. On the other hand, inferences from such studies are extremely sensitive to data samples and quality, and are more difficult to compare or replicate owing to differences in supplementary data sources at times not publicly available. As a contribution toward the unification and integration of good quality data from heterogeneous bioinformatics resources, we present here an integrated data bank PANDITplus. It is built as an extension of PANDIT, the database of PFAM alignments and phylogenetic trees for known protein domains and families spanning lineages from the three domains of life. PANDITplus is a relational database containing information on functional categories, metabolic pathways, protein–protein interactions, disease associations, gene expression, three-dimensional structure, as well as estimates from evolutionary analyses of selective pressures. User-friendly interface enables customized queries and fast data access. We recommend PANDITplus as a common bioinformatics platform for testing evolutionary hypotheses, which go beyond the mere inferences from molecular data by incorporating supplementary gene information. Equally, PANDITplus provides an excellent resource for the development, testing, and comparison of statistical models of substitution and probabilistic dependencies between a molecular sequence and its various attributes. The database may be accessed via http://www.panditplus.org. Introduction With advances in experimental techniques, recent years observed a rapid growth not only in molecular sequence data but also in complementary gene and protein information such as gene expression, numbers of proteinprotein interactions, three-dimensional structure, etc. Gene annotation equally is gaining accuracy and speed. Large-scale availability of such multi-facet data led to a common trend to incorporate the supplementary gene information with more conventional analyses of molecular sequences in order to reach more insightful conclusions. However, the results of many such studies are not easy to compare owing to their use of different data sources, produced by different laboratories, not always available to the public. Data quality and biases in gene or species samples may influence the inference. Whenever possible, it is desirable to test biological hypotheses using the same well-maintained and wellstructured integrated database solution. Here we present a relational database PANDITplus that makes a step towards the integration of data from a variety of reliable and curated bioinformatics sources. Along with DNA and amino acid sequence data for homologs, PANDITplus provides access to precomputed estimates from evolutionary codon models, data on protein interactions, functional and chemical pathway annotation, gene expression, and association with disease. The underlying database PANDIT contains homologous amino acid and proteincoding sequence alignments from Pfam, a comprehensive and accurate collection of protein domains and families. The idea behind PANDIT database was to encourage the “evolution-centric” analyses of protein domains and families, based on reliable sets of HMM-based alignments and associated phylogenetics trees. Both Pfam and PANDIT have been updated since their first publications and became popular for largescale studies of protein-coding genes or as testing platforms. Among some classic examples of using Pfam/PANDIT for evolutionary model development are studies presenting novel DNA, amino acid and codon substitution models, such as WAG, LG, ECM, SDT, and THMM. Recently, accuracy of the multiple protein-coding alignment method (implemented in MAGNOLIA) was tested on PANDIT data. Searching for data biases and universal trends also requires large well-maintained collections of data. Multiple alignments from PANDIT and functional classification from Gene Ontology (GO) have been used to study trends relating to positive selection, and to verify and extend the complexity hypothesis. Bofkin and Goldman used PANDIT to demonstrate that substitution patterns at three codon positions often are strikingly different, thus necessitating suitable statistical treatment during the phylogenetic analyses. Several authors developing statistical methodology and software for bioinformatics recently have stressed the importance of maintaining the resources like PANDIT. For example, PANDIT is included as a test database by the xREI tool for phylo-grammar visualization and development, which is currently a promising area in evolutionary methodology. Indeed, issues of model development, validation, and comparison may be better assessed based on a standardized data collection such as that jointly provided by PANDIT and PANDITplus. The inclusion of supplementary gene information allows for better classification, filtering, and pattern discovery, contributing to the development of better statistical models. This potentially leads to greater predictive power and a better understanding of underlying evolutionary

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

An Evolutionary and Phylogenetic Study of the BMP15 Gene

DNA sequence data contains a wealth of biologically useful information. Recent innovations in DNA sequencing technology have greatly increased our capacity to determine massive amounts of nucleotide sequences. These sequences can be used to specify the characteristics of different regions, interpret the evolutionary relationships between categorized groups, likelihood of performing multiple com...

متن کامل

Molecular and Bioinformatics Analysis of Allelic Diversity in IGFBP2 Gene Promoter in Indigenous Makuee and Lori-Bakhtiari Sheep Breeds

The aim of this study was to perform molecular and bioinformatics analysis of IGFBP2 gene promoter in association with some economic traits in indigenous Makuee (MS) and Lori-Bakhtiari (LB) breeds. DNA was extracted from blood samples of 120 MS and 200 LB and a 297 bp fragment from the upstream sequences of studied gene was amplified and genotyped by single-strand conformational polymo...

متن کامل

Contribution to the molecular systematics of the genus Capoeta from the south Caspian Sea basin using mitochondrial cytochrome b sequences (Teleostei: Cyprinidae)

Traditionally, Capoeta populations from the southern Caspian Sea basin have been considered as Capoeta capoeta gracilis. Study on the phylogenetic relationship of Capoeta species using mitochondrial cytochrome b gene sequences show that Capoeta population from the southern Caspian Sea basin is distinct species and receive well support (posterior probability of 100%). Based on the tree topologie...

متن کامل

Indelign: a probabilistic framework for annotation of insertions and deletions in a multiple alignment

MOTIVATION A quantitative study of molecular evolutionary events such as substitutions, insertions and deletions from closely related genomes requires (1) an accurate multiple sequence alignment program and (2) a method to annotate the insertions and deletions that explain the 'gaps' in the alignment. Although the former requirement has been extensively addressed, the latter problem has receive...

متن کامل

Quantitative Comparison of Tree Pairs Resulted from Gene and Protein Phylogenetic Trees for Sulfite Reductase Flavoprotein Alpha-Component and 5S rRNA and Taxonomic Trees in Selected Bacterial Species

Introduction: FAD is the cofactor of FAD-FR protein family. Sulfite reductase flavoprotein alpha-component is one of the main enzymes of this family. Based on applications of this enzyme in biotechnology and industry, it was chosen as the subject of evolutionary studies in 19 specific species. Method: Gene and protein sequences of sulfite reductase flavoprotein alpha-component, 5S rRNA sequence...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2010